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Abstract 

Sparse  Representation-based  Classification  (SRC)  is  a 
powerful  tool  in  distinguishing  signal  categories  which  lie 
on  different  subspaces.  Despite  its  wide  application  to  visu¬ 
al  recognition  tasks,  current  understanding  of  SRC  is  solely 
based  on  a  reconstructive  perspective,  which  neither  offer- 
s  any  guarantee  on  its  classification  performance  nor  pro¬ 
vides  any  insight  on  how  to  design  a  discriminative  dictio¬ 
nary  for  SRC.  In  this  paper,  we  present  a  novel  perspec¬ 
tive  towards  SRC  and  interpret  it  as  a  margin  classifier. 
The  decision  boundary  and  margin  of  SRC  are  analyzed  in 
local  regions  where  the  support  of  sparse  code  is  stable. 
Based  on  the  derived  margin,  we  propose  a  hinge  loss  func¬ 
tion  as  the  gauge  for  the  classification  performance  of  SRC. 
A  stochastic  gradient  descent  algorithm  is  implemented  to 
maximize  the  margin  of  SRC  and  obtain  more  discrimina¬ 
tive  dictionaries.  Experiments  validate  the  effectiveness  of 
the  proposed  approach  in  predicting  classification  perfor¬ 
mance  and  improving  dictionary  quality  over  reconstructive 
ones.  Classification  results  competitive  with  other  state-of- 
the-art  sparse  coding  methods  are  reported  on  several  data 
sets. 

1.  Introduction 

Since  it  was  originally  proposed  for  face  recognition,  the 
Sparse  Representation-based  Classification  (SRC)  [24]  has 
received  an  increasing  amount  of  attention,  and  it  has  been 
successfully  used  in  the  classification  of  various  visual  sig¬ 
nals  including  facial  expressions  [6],  hand  written  digits 
[25],  and  general  images  [5]. 

In  SRC,  a  test  signal  x  is  represented  as  a  sparse  linear 
combination  of  the  atoms  in  a  dictionary  D  composed  of 
training  data  from  all  classes,  i.e.  x  =  Da.  If  the  signals 
in  each  class  lie  in  a  low-dimensional  subspace  and  the  sub¬ 
spaces  of  different  classes  satisfy  certain  incoherence  con¬ 
ditions,  it  is  speculated  in  [24]  that  all  the  nonzero  coeffi¬ 
cients  in  sparse  code  ot  will  be  associated  with  the  dictio¬ 
nary  atoms  that  belong  to  the  same  class  as  x  .  This  argu¬ 


ment  has  gained  more  theoretical  support  latterly  from  the 
analysis  of  sparse  subspace  clustering  in  [21],  as  classifi¬ 
cation  can  be  regarded  as  clustering  new  data  into  existing 
clusters  with  known  labels.  However,  due  to  noise  corrup¬ 
tion  and  subspace  overlap,  the  nonzero  coefficients  in  a  are 
usually  associated  with  atoms  from  more  than  one  class  in 
practice.  This  problem  is  addressed  in  SRC  by  predicting 
the  label  as  the  class  whose  corresponding  coefficients  give 
the  smallest  reconstruction  error  of  x.  Although  such  clas¬ 
sification  scheme  shows  effectiveness  in  many  applications 
empirically,  its  working  mechanism  is  obscure  and  there  is 
no  guarantee  for  the  classification  performance.  Some  at¬ 
tempts  have  been  made  to  attribute  the  power  of  SRC  to 
collaborative  representation  [28],  but  the  analysis  is  quite 
limited. 

Due  to  the  absence  of  a  feasible  performance  metric  for 
SRC,  the  design  of  its  dictionary  (which  serves  as  the  pa¬ 
rameter  for  both  representation  and  classification)  has  been 
more  or  less  heuristic  so  far.  Originally,  an  SRC  dictionary 
is  constructed  by  directly  including  all  the  training  samples 
[24],  which  is  not  efficient  and  practical  when  the  size  of 
training  set  is  huge.  Random  sampling  or  clustering  meth¬ 
ods  such  as  K-means  can  give  a  compact  dictionary,  but 
generative  as  well  as  discriminative  capabilities  are  lost. 
Traditional  dictionary  learning  methods  specialized  for  s- 
parse  representation,  such  as  Method  of  Optimal  Direction 
(MOD)  [8],  K-SVD  [1],  and  the  ^i-relaxed  convex  formu¬ 
lations  [13,  15],  all  focus  on  minimizing  signal  reconstruc¬ 
tion  error  and  thus  are  not  optimized  for  classification  task. 
In  order  to  promote  the  discriminative  power  of  dictionar¬ 
ies,  recent  works  have  augmented  the  reconstructive  objec¬ 
tive  function  with  additional  discrimination  terms;  e.g.,  fish¬ 
er  discriminant  criterion  [27],  structural  incoherence  [20], 
class  residual  difference  [16,  25]  and  mutual  information 
[19].  Classification  models  other  than  SRC  have  also  been 
used  with  sparse  codes  as  inputs  [4,  10,  14].  The  discrim¬ 
ination  metrics  in  all  the  above  methods  are  not  geared  to 
the  mechanism  of  SRC;  moreover,  the  use  of  an  extra  clas¬ 
sification  model  (often  requiring  one-versus-rest  paradigm 
in  multi-class  cases)  will  multiply  the  number  of  parameters 
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and  increase  the  risk  of  over-fitting. 

In  this  paper,  we  present  a  novel  margin-based  perspec¬ 
tive  towards  SRC  and  propose  a  maximum  margin  per¬ 
formance  metric  that  is  specifically  designed  for  learning 
the  dictionaries  of  SRC.  Large  margin  classifiers  [2]  are 
well  studied  by  the  machine  learning  community,  and  they 
have  many  desirable  properties  such  as  robustness  to  noise 
and  outlier,  and  theoretical  connection  with  generalization 
bound.  Due  to  the  complex  nonlinear  mapping  induced  by 
sparse  coding,  evaluating  the  margin  of  SRC  is  nontrivial. 
Based  on  the  local  stability  of  sparse  code  support,  we  show 
in  Sec.  2  that  the  decision  boundary  of  SRC  is  a  continuous 
piecewise  quadratic  surface,  and  the  margin  of  a  sample  is 
approximated  as  its  distance  to  the  tangent  plane  of  the  de¬ 
cision  function  in  a  local  region  where  the  support  of  sparse 
code  is  stable.  Following  the  idea  of  Support  Vector  Ma¬ 
chine  (SVM),  we  propose  in  Sec.  3  to  use  the  hinge  loss  of 
approximated  margin  as  a  metric  for  the  classification  per¬ 
formance  and  generalization  capability  of  SRC.  A  stochas¬ 
tic  gradient  descent  algorithm  is  then  implemented  to  max¬ 
imize  the  margin  of  SRC  and  obtain  more  discriminative 
dictionaries.  To  the  best  of  our  knowledge,  we  are  the  first 
to  conduct  margin  analysis  on  SRC  and  optimize  its  dictio¬ 
nary  by  margin  maximization.  The  experiments  in  Sec.  4 
validate  the  effectiveness  of  our  margin-based  loss  function 
in  predicting  classification  performance.  It  is  shown  on  sev¬ 
eral  data  sets  that  our  algorithm  can  learn  very  compact  dic¬ 
tionaries  that  attain  much  higher  accuracies  than  the  con¬ 
ventional  dictionaries  in  SRC;  the  performance  is  also  com¬ 
petitive  with  other  state-of-the-art  methods  based  on  sparse 
coding.  Sec.  5  draws  conclusion  and  discusses  future  work. 

2.  Margin  Analysis  of  SRC 
2.1.  Preliminary 

Suppose  our  data  sample  x  lies  in  the  high  dimensional 
space  and  comes  from  one  of  the  C  classes  with  label 
y  G  {1...C'}.  In  SRC,  a  dictionary  D  G  with  n  atoms 

is  composed  of  C  class-wise  sub-dictionaries  Dc  G 
such  that  D  =  [Di,...,Dc]  =  [di, ...,  d„].  Given  D,  we 
can  find  the  sparse  code  a  G  K"  for  signal  x  by  solving  the 
following  LASSO  problem: 

a  =  argmin||Dz-x||2-f  A||z||i,  (1) 

Z 

where  A  >  0  is  a  constant.  The  sparse  code  can  be  de¬ 
composed  into  C  sub-codes  as  a  =  [cti; ...;  ac]^  where 
each  etc  corresponds  to  the  coefficients  for  sub-dictionary 
Dc.  SRC  makes  classification  decision  based  on  the  resid¬ 
ual  of  signal  approximated  by  the  sub-code  of  each  class: 
fc  =  where  Oc  =  DcCCc  —  x  is  the  reconstruction 

error  vector  for  class  c.  The  class  label  is  then  predicted  as: 

y  =  argminrc.  (2) 


More  detailed  explanation  of  SRC  can  be  found  in  [24]. 

2.2.  Local  Decision  Boundary  for  SRC 

To  perform  margin-based  analysis  for  SRC,  we  first  need 
to  find  its  classification  decision  boundary.  Consider  two 
classes  ci  and  C2,  and  assume  the  dictionary  D  is  given.  The 
decision  function  at  sample  x  is  simply  defined  as  /(x)  = 
rc2  ~  rci  ^  0.  /(x)  can  be  expanded  as: 

/(x)  =  2{T^ciOLc^-'Dc^e)Lc^fyi-\\GciOLcA?  +  \\^C2(^c^\‘^- 

(3) 

Eq.  (3)  could  be  regarded  as  a  linear  hyper-plane  in  the  s- 
pace  of  data  x,  if  the  sparse  code  ot  was  fixed.  What  compli¬ 
cates  things  here  is  that  ot  is  also  determined  by  x  through 
the  sparse  coding  model  in  (1),  and  the  hyper-plane  in  (3) 
will  change  with  any  small  change  in  x.  Expressing  ot  ana¬ 
lytically  as  a  function  of  x  is  not  possible  in  general,  unless 
we  know  a  priori  the  support  and  sign  vector  of  ot.  In  the 
latter  case,  the  non-zero  part  of  a  can  be  found  according 
to  the  optimal  condition  of  LASSO  solution  [29]: 

aA  =  (DlDA)-'(Dlx-^SA),  (4) 

where  A  =  {j\aj  ^  0}  is  the  active  set  of  sparse  co¬ 
efficients  with  cardinality  |A|  =  ||q:||o  =  s,  a  a  € 
contains  the  sparse  coefficients  at  these  active  locations. 
Da  G  composed  of  the  columns  in  D  correspond¬ 

ing  to  A,  and  sa  G  R®  carries  the  signs  (±1)  of  a  a-  Al¬ 
though  the  active  set  A  and  sign  vector  sa  also  depend  on 
X,  it  can  be  shown  (in  supplementary  material)  that  they  are 
locally  stable  if  x  changes  by  a  small  amount  of  Ax  satis¬ 
fying  the  following  stability  conditions: 

r  |dj{e  +  [Da(DXDa)-idJ  -  I]Ax}|  <  |,Vj  ^  A 

\  SA  ©  [(DXDa)“^DJAx]  >  -SA  ©  qa  ’ 

(5) 

where  ©  denotes  element-wise  multiplication,  and  e  = 
Dacca  —  x  is  the  global  reconstruction  error.  All  the  condi¬ 
tions  in  (5)  are  linear  inequalities  for  Ax.  Therefore,  the  lo¬ 
cal  neighborhood  around  x  where  the  active  set  (and  signs') 
of  signal’s  sparse  code  remains  stable  is  a  convex  polytope. 

Now  substitute  the  sparse  code  terms  in  (3)  with  (4),  and 
after  some  manipulations  we  obtain  a  quadratic  local  deci¬ 
sion  function  /a  (x)  which  is  defined  for  any  x  whose  sparse 
code  corresponds  to  active  set  A: 

/a(x)  =X^$J^$C2X  +  2iz;^$c2X  -I- 

-  (x'^$J^$ciX  +  2zy;[$c^x-f  (6) 

where 

^c  =  DaPc(DIDa)-1dI-I,  (7) 

Vc  =  -^DaPc(DaDa)~^sa,  (8) 

^In  the  following,  the  concept  of  sign  vector  sa  is  included  by  default 
when  we  refer  to  “active  set”  or  “A”. 


and  Pc  is  an  s  X  s  diagonal  matrix  with  1  at  positions  cor¬ 
responding  to  class  c  in  the  active  set  and  0  otherwise.  The 
above  analysis  leads  to  the  following  proposition  for  the  de¬ 
cision  function  of  SRC. 

Proposition  2.1  The  decision  function  of  SRC  is  a  piece- 
wise  quadratic  function  of  input  signal  with  the  form  of 

/(x)  =  /a(x),  (9) 

for  any  x  in  the  convex  polytope  defined  by  Eq.  (5)  where 
the  active  set  A  of  its  sparse  code  is  stable. 

Since  there  are  a  set  of  quadratic  decision  functions  each 
devoted  to  a  local  area  of  x,  SRC  is  capable  of  classifying 
data  which  cannot  be  linearly  or  quadratically  separated  in  a 
global  sense.  The  decision  boundary  of  SRC  can  be  adapt¬ 
ed  to  each  local  area  in  the  most  discriminative  and  compact 
way,  which  shares  a  similar  idea  with  locally  adaptive  met¬ 
ric  learning  [7].  On  the  other  hand,  these  quadratic  function- 
s  as  well  as  the  partition  of  local  areas  cannot  be  adjusted 
individually;  they  are  all  tied  via  a  common  model  D.  This 
is  crucial  to  reduce  model  complexity  and  enhance  infor¬ 
mation  sharing  among  different  local  regions,  considering 
there  could  be  as  many  as  3"  regions^  out  of  the  partition  of 
the  entire  signal  space. 

To  hnd  the  decision  boundary  of  SRC,  we  simply  need 
to  check  at  what  values  of  x,  /(x)  will  vary  from  positive 
to  negative,  as  the  decision  threshold  is  0.  It  has  been  show 
in  [29]  that  the  sparse  code  a  is  a  continuous  function  of  x. 
Thus  we  can  easily  see  that  /(x)  is  also  continuous  over  the 
entire  domain  of  x,  and  the  points  on  the  decision  boundary 
of  SRC  have  to  satisfy  /(x)  =  0,  which  is  stated  in  the 
following  proposition. 

Proposition  2.2  The  decision  boundary  of  SRC  is  a  piece- 
wise  quadratic  hypersurface  defined  by  /(x)  =  0  . 

2.3.  Margin  Approximation  for  SRC 

For  linear  classihers,  the  margin  of  a  sample  is  dehned 
as  its  distance  from  the  decision  hyperplane.  In  the  context 
of  SRC,  we  similarly  dehne  the  margin  of  a  sample  Xq  as 
its  distance  to  the  closest  point  on  the  decision  boundary; 
minj(x)=o  l|xo  — x||2.  Unfortunately,  due  to  the  complexity 
of  SRC’s  decision  function  /(x),  it  is  difficult  to  evaluate 
the  associated  margin  directly. 

Instead,  we  estimate  Xq’s  margin  by  approximating  /(x) 
with  its  tangent  plane  at  Xq.  Such  approximation  is  ap¬ 
propriate  only  when  gradient  V/(x)  does  not  change  too 
much  as  /(x)  descents  from  /(xq)  to  0,  which  is  general¬ 
ly  true  based  on  the  following  observations.  First,  within 
each  polytope  for  a  stable  active  set  A,  V/a(x)  is  a  lin¬ 
ear  function  of  x  and  will  not  change  a  lot  if  Xg  lies  close 

^Each  atom  can  be  assigned  with  a  positive,  negative,  or  zero  coeffi- 
cient. 


correlation  of  gradient  directions  ratio  of  gradient  magnitudes 


(a)  (b) 

Figure  1 .  The  histograms  of  the  (a)  correlation  and  (b)  magnitude 
ratio  between  the  decision  function  gradients  V/ai  and  V/a2  on 
the  MNIST  data  set.  V/ai  is  the  gradient  at  original  data  x,  and 
V  /a2  is  the  gradient  at  data  with  a  small  perturbation  Ax  from  x, 
such  that  only  one  of  the  conditions  in  Eq.  (5)  is  violated.  Both  (a) 
and  (b)  are  highly  peaked  around  1 . 


to  the  boundary.  Second,  as  implied  by  the  empirical  hnd- 
ings  in  Fig.  1,  if  we  have  two  contiguous  poly  topes  cor¬ 
responding  respectively  to  two  stable  active  sets,  Ai  and 
A2,  which  are  the  same  except  for  one  entry,  then  with  a 
high  probability  the  gradient  of  decision  function  in  the  two 
polytopes  will  be  approximately  the  same  near  their  bor¬ 
der:  V/ai  ~  V/aj-  This  observation  allows  us  to  approxi¬ 
mate  the  decision  function  over  a  number  of  polytopes  with 
a  common  tangent  plane.  Third,  as  will  be  discussed  in 
Sec.  3,  we  are  more  interested  in  the  data  samples  near  the 
decision  boundary  when  optimizing  a  large  margin  classih- 
er.  Thus,  those  faraway  samples  whose  margins  cannot  be 
accurately  approximated  can  be  safely  ignored.  Therefore, 
our  approximation  is  also  suitable  for  the  use  with  margin 
maximization. 

Once  the  decision  function  /(x)  is  linearly  approximat¬ 
ed,  the  margin  7  of  Xg  is  simply  its  distance  (with  sign)  to 
the  hyperplane  /(x)  =  0: 

-yU  \  /(^O)  ^  /(Xq) 

||V/(xg)||2  ||V/A(xg)||2 

= - ~  -  (10) 

211^^26^2  - 

where  we  use  the  relationship  Oc  =  -F  t>c  to  simplify 
the  expression  in  (10);  all  the  and  iz^’s  are  calculated 
according  to  (7)  and  (8)  with  the  active  set  A  of  Xg’s  sparse 
code.  It  should  be  noted  that  the  decision  function  gradient 
V/  is  not  dehned  on  the  borders  of  convex  polytopes  with 
different  active  sets.  In  such  a  case,  we  just  replace  ||  V/II2 
with  the  largest  directional  derivative  evaluated  in  all  the 
pertinent  polytopes. 

In  SRC,  all  data  samples  are  usually  normalized  onto  the 
unit  ball  such  that  ||x||2  =  1.  In  this  way,  the  change  of 
/(x)  in  the  direction  of  Xg  itself  should  not  be  taken  into 
account  when  we  calculate  the  margin  of  Xg.  The  margin 


0.7 


change  of  x  from  xq  in  the  direction  of  V/(xo) 


Figure  2.  Top:  decision  function  /(x)  for  class  “7”  against  class 
“4”  in  the  MNIST  data  set  and  its  approximations,  where  x 
changes  in  the  ID  neighborhood  of  a  sample  xq  in  the  direction 
of  gradient  V/(xo).  Bottom:  the  images  of  x  as  it  moves  in  the 
direction  of  V/(xo)  (from  left  to  right).  The  central  image  corre¬ 
sponds  to  the  original  sample  xq. 


expression  can  be  further  amended  as 

/  'I  ^  /(Xq)  ^  -  T-ci _ 

||MV/(xo)||2  2\\M{^le,,  - 

(11) 

where  M  =  (I  —  XqXj^). 

Fig.  2  graphically  illustrates  our  margin  approximation 
approach  for  one  image  sample  Xq  from  class  “7”  in  the  M- 
NIST  digits  data  set.  We  evaluate  the  ground  truth  value 
of  decision  function  /(x)  at  a  series  of  data  points  x  in  a 
ID  interval  generated  by  shifting  Xq  along  the  direction  of 
V/(xo),  and  record  all  the  points  where  the  active  set  of 
sparse  code  changes.  We  can  see  that  the  piecewise  smooth 
/(x)  (plotted  as  a  red  curve)  can  be  well  approximated  by 
the  tangent  of  local  quadratic  decision  function  (green  as¬ 
terisk)  in  a  neighborhood  where  the  active  set  (whose  stable 
region  is  delimitated  by  red  plus)  does  not  change  too  much. 
However,  the  linear  approximation  (blue  cross)  suggested 
by  Eq.  (3)  is  much  less  accurate,  though  they  all  intersect  at 
point  Xq.  The  margin  (indicated  by  golden  arrow)  we  find 
for  this  example  is  very  close  to  its  true  value.  Fig.  2  also 
shows  how  the  appearance  of  the  image  signal  is  distorted 
to  the  imposter  class  “4”  from  its  true  class  “7”  as  it  moves 
along  the  gradient  of  decision  function. 

3.  Maximum-Margin  Dictionary  Learning 

The  concept  of  maximum  margin  has  been  widely  em¬ 
ployed  in  training  classifiers,  and  it  serves  as  the  corner¬ 
stone  of  many  popular  models  including  SVM.  The  classi¬ 


cal  analysis  on  SVM  [22]  established  the  relationship  be¬ 
tween  the  margin  of  the  training  set  and  the  classifier’s  gen¬ 
eralization  error  bound.  Recently,  a  similar  effort  has  been 
made  for  sparsity -based  linear  predictive  classifier  [18], 
which  motivates  us  to  design  the  dictionary  for  SRC  from  a 
perspective  based  on  the  margin  analysis  given  in  Sec.  2. 

Suppose  we  have  a  set  of  N  labeled  training  data  sam¬ 
ples:  {xi,  jv.  Learning  a  discriminative  dictionary 

D*  for  SRC  can  be  generally  formulated  as  the  following 
optimization  problem: 

D*  =  argmjn  ^^£(x,,2/,;D).  (12) 


where  V  denotes  dictionary  space  with  unit-norm 

atoms.  To  maximize  the  margin  of  a  training  sample  close  to 
the  decision  boundary  of  SRC,  we  follow  the  similar  idea  in 
SVM  and  define  the  loss  function  £(x,  y,  D)  using  a  multi¬ 
class  hinge  function: 

y;  D)  =  ^  max{0,  -7(x,  y,  c)  +  b},  (13) 

where  6  is  a  non-negative  parameter  controlling  the  mini¬ 
mum  required  margin  between  classes,  and 


7(x,y,c)  = 


Vr-r 


2||M($^e,  - 


(14) 


is  the  margin  of  sample  x  with  label  y  calculated  against  a 
competing  class  c  ^  y,  which  is  adopted  from  Eq.  (11).  The 
loss  function  in  (13)  is  zero  if  the  sample  margin  is  equal  or 
greater  than  6;  otherwise,  it  gives  penalty  linearly  propor¬ 
tional  to  negative  margin.  Different  from  what  is  defined  in 
SVM,  the  margin  we  use  here  is  unnormalized  since  the  unit 
dictionary  atom  constraint  ensures  the  objective  function  is 
bounded.  Moreover,  (13)  promotes  multi-class  margin  by 
summing  over  all  possible  imposter  classes  c  and  optimiz¬ 
ing  the  single  parameter  D  that  is  shared  by  all  classes.  This 
offers  an  advantage  over  a  set  of  one-versus-rest  binary  clas¬ 
sifiers  whose  margins  can  only  be  optimized  separately. 

According  to  the  numerator  in  (14),  the  residual  differ¬ 
ence  between  the  correct  and  incorrect  classes,  Tc  —  Vy, 
should  be  maximized  to  achieve  a  large  margin.  Such  re¬ 
quirement  is  consistent  with  the  classification  scheme  in  (2), 
and  it  has  also  been  enforced  in  other  dictionary  learning  al¬ 
gorithms  such  as  [16].  In  addition,  we  further  introduce  a 
novel  term  in  the  denominator  of  (14),  which  normalizes  the 
nonuniform  gradient  of  SRC  decision  function  in  differen- 
t  local  regions  and  leads  to  a  better  estimation  to  the  true 
sample  margin. 


3.1.  Online  Dictionary  Learning 

We  solve  the  optimization  problem  in  Eq.  (12)  using 
an  online  algorithm  based  on  stochastic  gradient  descent 


method,  which  is  usually  favored  when  the  objective  func¬ 
tion  is  an  expectation  over  a  large  number  of  training  sam¬ 
ples  [15].  In  our  algorithm,  the  dictionary  is  first  initialized 
with  a  reasonable  guess  D°  (which  can  be  the  concatena¬ 
tion  of  sub-dictionaries  obtained  by  applying  K-means  or 
random  selection  to  each  class).  Then  we  go  through  the 
whole  data  set  multiple  times  and  iteratively  update  the  dic¬ 
tionary  with  decreasing  step  size  until  convergence.  In  the 
f-th  iteration,  a  single  sample  (x,  y)  is  drawn  from  the  data 
set  randomly  and  the  dictionary  is  updated  in  the  direction 
of  the  gradient  of  its  loss  function; 

D*  =  D‘-i-p‘[VD/:(x,t/;D‘-i)]^,  (15) 

where  p*  is  the  step  size  at  iteration  t.  It  is  selected  as  p*  = 
,  ^  =  with  initial  step  size  The  gradient  of  our 

loss  function  is 

VD>C(x,y;D)  =  -  ^  VD7(x,t/,c)  (16) 

ceC(x,y) 

where  C(x,y)  =  {c|c  ^  y,-f{x,y,c)  <  b}.  We  ig¬ 
nore  those  competing  classes  with  zero  margin  gradien- 
t  (7(x,  y,  c)  >  b)  or  zero  sub-gradient  (7(x,  y,  c)  =  b).  The 
latter  case  occurs  with  very  low  probability  in  practice  and 
thus  will  not  affect  the  convergence  of  stochastic  gradient 
descent  as  long  as  a  suitable  step  size  is  chosen  [3]. 

All  that  remains  to  be  evaluated  is  Vd7(x,  y,  c),  which 
can  be  obtained  by  taking  derivative  of  Eq.  (14)  with  respect 
to  D.  We  realize  from  the  results  in  [18]  that  the  active  set 
A  for  any  particular  sample  x  is  stable  when  there  is  a  small 
perturbation  applied  to  dictionary  D.  Since  the  approxima¬ 
tion  of  margin  is  also  based  on  a  locally  stable  A,  we  can 
safely  deduce  that  7(x,  y,  c)  is  a  differentiable  function  of 
D.  In  this  way,  we  circumvent  the  trouble  of  indifferen¬ 
tiability  when  directly  taking  derivative  of  sparse  code  with 
respect  to  D  as  has  been  done  in  [14,  26].  In  addition,  since 
(14)  depends  only  on  Da,  we  just  need  to  update  those  dic¬ 
tionary  atoms  corresponding  to  the  active  set  A  of  each  sam¬ 
ple  X.  The  dictionary  updating  rule  in  (15)  can  be  rewritten 
as: 

Di  =  Di-i+p‘.[VD.7(x,y,c)f,  (17) 

which  is  repeated  for  all  c  G  C(x,y).  The  specific  form  of 
Vda7(x,  y,  c)  is  given  in  supplementary  material.  Once  the 
dictionary  is  updated  in  the  current  iteration,  all  its  atoms 
are  projected  to  the  unit  ball  to  comply  with  the  constrain- 
t  that  D  G  T>.  The  overall  Maximum-Margin  Dictionary 
Learning  (MMDL)  approach  is  summarized  in  Algorithm  1. 

3.2.  Interpreting  the  Learning  Algorithm 

The  gradient  term  in  (17)  takes  a  very  complicated  for- 
m  as  given  in  supplementary  material.  Nevertheless,  some 
intuition  can  be  obtained  from  its  expression  about  how  our 


Algorithm  1  Maximum-Margin  Dictionary  Learning  (M- 
MDL)  for  SRC 

Input:  labeled  data  set  S  =  {xi,yi},  dictionary  size  n, 
sparse  regularization  A,  required  margin  b 
Output:  dictionary  D 

1:  initialize  D  with  all  class-wise  sub-dictionaries  Dc  (ob¬ 
tained  using  K-means) 

2:  set  f  =  1 

3:  while  not  converge  do 
4:  randomly  permute  data  set  S 

5:  for  each  (x,  y)  G  S  do 

6:  for  each  c  in  C  (x,  y)  do 

7:  update  Da  according  to  Eq.  (17) 

8:  end  for 

9:  dj  3—  dj7||dj||  for  each  j  G  A 

10:  t  3 —  f  -f  1 

11:  end  for 
12:  end  while 

13:  return  D 


algorithm  works.  We  first  notice  that  Eq.  (17)  will  add  Oc  to 
all  the  active  atoms  associated  with  class  c  and  subtract 
from  all  the  active  atoms  associated  with  class  y,  both  with  a 
scaling  factor  proportional  to  each  atom’s  sparse  coefficien- 
t.  Such  operation  effectively  “pulls”  those  active  atoms  of 
correct  class  towards  the  signal,  and  “pushes”  those  active 
atoms  of  imposter  class  away  from  the  signal,  which  is  sim¬ 
ilar  to  the  strategies  used  to  optimize  codebook  in  Learning 
Vector  Quantization  (LVQ)  [11]  and  Large  Margin  Near¬ 
est  Neighbor  (LMNN)  [23].  In  addition,  (17)  also  uses  the 
overall  reconstruction  error  e  and  the  projections  of  Oc  and 
Gy  as  the  ingredients  to  update  the  active  atoms  from  all 
the  classes,  which  is  reasonable  because  the  sparse  code  is 
jointly  determined  by  all  the  active  atoms. 

On  the  other  hand,  we  observe  from  Eq.  (16)  that  on¬ 
ly  those  difficult  samples  that  have  small  margins  against 
the  imposter  classes  are  selected  to  participate  in  dictionary 
training.  Similar  sample  selection  schemes  are  also  found  in 
LVQ  and  LMNN.  Therefore,  our  choice  of  hinge  loss  func¬ 
tion  is  supported  from  the  perspective  of  other  previously 
developed  large-margin  classifiers. 

4.  Experimental  Results 
4.1.  Algorithm  Analysis 

To  get  a  better  understanding  of  the  proposed  method, 
we  first  conduct  some  analysis  on  its  behavior  in  this  section 
using  a  subset  of  20,000  training  samples  from  the  MNIST 
[12]  digits  data  set. 

The  accuracy  of  SRC  margin  approximation,  which  is 
key  to  the  effectiveness  of  our  method,  is  first  investigated. 
Because  it  is  impossible  to  find  the  exact  margin  of  a  sample 


Figure  3.  Distributions  of  estimated  margin  7(x)  and  residual  dif¬ 
ferent  rc2  —  fci  plotted  against  directional  margin  measured  in  the 
gradient  direction  V/(x). 


Figure  4.  The  objective  function  on  training  set  and  recognition 
accuracy  on  test  set  during  the  iterations  of  MMDL  algorithm. 

xq,  we  use  the  shortest  distance  between  Xq  and  the  decision 
boundary  in  the  gradient  direction  V/(xo)  as  a  surrogate  to 
the  ground  truth  margin.  Such  “directional  margin”  is  found 
by  a  line  search  and  plotted  in  Fig.  3  against  the  estimated 
margin  7(xo)  using  Eq.  (11)  for  all  the  samples.  A  strong 
linear  relationship  is  observed  between  the  directional  and 
estimated  margin,  especially  for  those  samples  with  small 
margins  which  are  indeed  to  the  interest  of  our  algorithm. 
We  also  plot  the  distribution  of  residual  difference  — 
Tci ,  which  shows  a  weaker  correlation  with  the  directional 
margin.  This  justihes  that  maximizing  7(x)  as  dehned  in 
(14)  is  a  better  choice  than  simply  maximizing  for 

large-margin  optimization. 

The  behavior  of  our  MMDL  algorithm  is  examined  in 
Fig.  4.  The  objective  function  value  over  the  training  sam¬ 
ples  decreases  steadily  and  converges  in  about  70  iterations. 
At  the  same  time,  the  recognition  accuracy  on  a  separate  test 
set  is  remarkably  improved  during  the  iterations,  indicating 
a  good  correspondence  between  our  margin-based  objective 
function  and  SRC’s  generalized  performance 

The  minimum  required  margin  b  in  Eq.  (13)  is  an  im¬ 
portant  parameter  in  MMDL,  whose  effect  on  recognition 
performance  is  shown  in  Table  1.  A  too  small  value  of  b 

^We  do  observe  some  small  fluctuations  on  the  testing  accuracy,  which 
is  caused  hy  the  stochastic  gradient  descent. 


Table  1 .  The  effect  of  parameter  b  on  classification  accuracy. 


b 

0 

0.05 

0.1 

0.15 

0.2 

train  acc. 

100.00 

100.00 

99.44 

98.45 

97.39 

test  acc. 

96.78 

98.01 

98.13 

97.36 

96.77 

Figure  5.  Dictionary  atoms  for  MNIST  digits  data,  learned  using 
unsupervised  sparse  coding  (row  1,  3)  and  MMDL  (row  2,  4). 


LDA  dimension 

Figure  6.  Correlation  between  the  first  principal  component  of 
atoms  from  different  dictionaries  and  the  LDA  directions  of  the 
MNIST  training  data. 

leads  to  over-fitting  to  training  set,  while  a  too  large  value 
leads  to  bias  of  the  classification  objective.  We  find  b  =  0.1 
is  generally  a  good  choice  on  different  data  sets,  and  gradu¬ 
ally  reducing  b  during  the  iterations  can  help  the  algorithm 
focus  more  on  those  hard  samples  near  decision  boundary. 

The  image  patterns  of  some  dictionary  atoms  obtained 
using  MMDL  are  shown  in  Fig.  5,  together  with  those  ob¬ 
tained  using  unsupervised  sparse  coding  [13],  which  were 
used  to  initialize  the  dictionary  in  MMDL.  The  discrimi¬ 
native  atoms  trained  with  MMDL  look  quite  different  from 
their  initial  reconstructive  appearances,  and  place  more  em¬ 
phasis  on  local  edge  features  that  are  unique  to  each  class. 
The  discriminative  power  of  our  learned  dictionary  can  be 
further  demonstrated  in  Fig.  6,  which  shows  that,  com¬ 
pared  with  K-means  and  unsupervised  sparse  coding,  the 
MMDL  algorithm  learns  dictionary  atoms  whose  first  prin¬ 
ciple  component  has  a  much  higher  correlation  with  most 
of  the  LDA  directions  (especially  the  first  one)  of  the  train¬ 
ing  data.  Although  LDA  directions  may  not  be  optimal  for 
SRC,  our  dictionary  atoms  appear  to  be  more  devoted  to 
discriminative  features  instead  of  reconstructive  ones. 

4.2.  Recognition  Performance  Evaluation 

Now  we  report  the  recognition  performance  of  the  pro¬ 
posed  method  on  several  benchmark  data  sets.  SRC  is  most 
well  known  for  face  recognition  task,  therefore  we  first  test 


Table  2.  Recognition  accuracies  (%)  on  face  databases. 


Method 

Extended 

YaleB 

AR  Eace 

Full 

97.34 

96.50 

Subsample 

91.20 

73.17 

KSVD  [1] 

88.63 

90.00 

Kmeans 

95.44 

90.00 

Unsup  [13] 

96.35 

90.33 

LC-KSVD  [10] 

95.00 

93.70 

MMDL 

97.34 

97.33 

Error  reduction  (%) 

27.12 

72.39 

Table  3.  Performance  of  SRC  on  the  MNIST  digits  database. 


Training  method  /  Size  of  D 

Accuracy  (%) 

Subsample  /  30000 

98.05 

Subsample  /  150 

82.19 

Kmeans  /  150 

94.19 

Unsup  [13]  /  150 

94.84 

Ramirez  et  al.  [20]  /  800 

98.74 

MMDL  /  150 

98.76 

Error  reduction  (%) 

75.97 

on  two  face  data  sets;  extended  YaleB  [9]  and  AR  face  [17]. 
We  use  2,414  face  images  of  38  subjects  from  the  extend¬ 
ed  YaleB  data  set,  and  a  subset  containing  2,600  images 
of  50  female  and  50  male  subjects  from  the  AR  face  da¬ 
ta  set.  We  follow  the  procedure  in  [10]  to  split  the  train¬ 
ing  and  test  data,  and  obtain  random  projected  face  features 
of  504(540)-dimension  for  extended  YaleB(AR  face).  For 
both  data  sets,  we  compare  the  performance  of  SRC  with 
dictionaries  obtained  from  the  full  training  set  (“Full”),  ran¬ 
dom  subsampling  of  training  set  (“Subsample”),  KSVD  [1], 
K-means  (“Kmeans”),  unsupervised  sparse  coding  (“Un¬ 
sup”)  [13],  and  our  MMDL  algorithm.  Comparison  with 
the  state-of-the-art  results  of  LC-KSVD  [10]  is  also  given, 
which  employs  a  linear  classihcation  model  on  space  codes. 
For  extended  YaleB(AR  face),  15(5)  atoms  per  subject  are 
used  for  all  the  dictionaries  expect  for  “Full”,  and  A  is  set 
as  0.01(0.005).  As  shown  in  Table  2,  MMDL  achieves  the 
highest  accuracies  on  both  data  sets,  and  outperforms  the 
“Full”  SRC  on  AR  face  using  a  much  smaller  dictionary. 
The  huge  reduction  in  the  error  rate  of  MMDL  with  respect 
to  its  initialization  value  given  by  “Unsup”  further  conhrms 
the  effectiveness  of  our  learning  algorithm. 

Our  method  is  also  evaluated  on  the  full  MNIST  data 
set,  which  contains  60,000  images  for  training  and  10,000 
for  testing.  We  use  PCA  to  reduce  the  dimension  of  each 
image  such  that  99%  energy  is  preserved,  and  set  A  =  0.1. 
Table  3  lists  the  classihcation  accuracies  of  SRC  with  dic¬ 
tionaries  trained  using  various  methods  and  with  different 
sizes.  MMDL  is  shown  to  be  advantageous  over  other  meth¬ 
ods  in  terms  of  both  accuracy  and  dictionary  compactness. 
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residual  for  predicted  class 


Figure  7.  Distributions  of  correctly  and  incorrectly  classified  test 
samples  plotted  against  estimated  margin  and  reconstruction  resid¬ 
ual  using  the  atoms  from  predicted  class. 


Figure  8.  Two  misclassified  samples  corresponding  to  the  red 
crosses  marked  by  (a)  and  (b)  in  Fig.  7.  From  left  to  right:  original 
sample;  reconstruction  with  atoms  of  predicted  class;  reconstruc¬ 
tion  with  atoms  of  truth  class;  sparse  coefficients. 


the  latter  of  which  implies  higher  efficiency  in  computation 
as  well  as  storage.  Note  that  we  are  unable  to  evaluate  SR- 
C  with  the  “Full”  setting  because  the  memory  required  for 
the  operation  on  such  a  huge  dictionary  exceeds  our  system 
capacity  (32GB). 

Fig.  7  reveals  the  distinct  distributions  of  correctly  and 
incorrectly  classified  samples  in  terms  of  estimated  margin 
and  reconstruction  residual  with  predicted  class.  The  in¬ 
correct  samples  are  observed  to  have  higher  residuals  and 
smaller  margins,  which  is  expected  since  hard  samples  typ¬ 
ically  can  not  be  well  represented  by  the  corresponding 
classes  and  lie  close  to  the  boundary  of  imposter  classes. 
This  provides  another  evidence  to  show  the  accuracy  of  our 
margin  estimation.  Therefore,  the  estimated  margin  can  al¬ 
so  serve  as  a  metric  of  classification  confidence,  based  on 
which  the  classification  results  could  be  further  refined.  T- 
wo  cases  of  failed  test  samples  are  illustrated  in  Fig.  8.  The 
digit  “7”  in  (a)  is  misclassified  as  “2”  with  a  large  margin 
due  to  the  strong  inter-class  similarity  and  high  intra-class 
variation  insufficiently  captured  by  the  training  set.  The  dig¬ 
it  “5”  in  (b)  cannot  be  faithfully  represented  by  any  class; 
such  an  outlier  has  a  very  small  margin  and  thus  can  be  po¬ 
tentially  detected  for  special  treatment. 


5.  Conclusion  and  Future  Directions 

An  in-depth  analysis  of  the  classification  margin  for  SRC 
is  presented  in  this  paper.  We  show  that  the  decision  bound¬ 
ary  of  SRC  is  a  continuous  piecewise  quadratic  hypersur¬ 
face,  and  it  can  be  approximated  by  its  tangent  plane  in  a 
local  region  to  facilitate  the  margin  estimation.  A  learning 
algorithm  based  on  stochastic  gradient  descent  is  derived  to 
maximize  the  margins  of  training  samples,  which  generates 
compact  dictionaries  with  substantially  improved  discrimi¬ 
native  power  observed  on  several  data  sets. 

In  the  future  work,  we  will  explore  better  ways  to  ap¬ 
proximate  the  margin  of  samples  far  away  from  the  decision 
boundary  in  the  hope  to  further  improve  dictionary  quality. 
It  would  also  be  of  great  interest  to  establish  a  strict  relation¬ 
ship  between  the  margin  and  generalization  performance  of 
SRC,  so  that  a  better  knowledge  can  be  gained  about  under 
what  circumstances  is  SRC  expected  to  perform  best. 
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